Add more tool call integration tests and improve integration test infrastructure#142
Add more tool call integration tests and improve integration test infrastructure#142atdrendel wants to merge 6 commits intoml-explore:mainfrom
Conversation
|
In #118, all of the integration tests were moved to a separate module that can be imported to run in the adapter packages. So either this gets merged first and the new/revised tests need to be re-implemented there, or that gets merged first and the tests need to be implemented in the new way. |
|
I'm happy to have #118 get merged first, and then I'll re-add these tests in the correct spot. |
f888b37 to
877cf57
Compare
There was a problem hiding this comment.
Pull request overview
Refactors MLXLM integration tests to use a shared async model-loading actor (instead of per-test-suite static state) and expands tool-call integration coverage to additional models.
Changes:
- Centralized model IDs and added cached async container loaders in
IntegrationTestModels. - Updated tool-call integration tests to lazily load model containers via
IntegrationTestModelswith skip-on-unavailable behavior. - Added new Nemotron and Qwen3.5 tool-call auto-detection and end-to-end parsing tests.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| Tests/MLXLMIntegrationTests/ToolCallIntegrationTests.swift | Reworked model loading to use shared async loaders; added Nemotron + Qwen3.5 tool-call tests. |
| Tests/MLXLMIntegrationTests/IntegrationTestModels.swift | Added additional model IDs and per-model cached loading methods (via stored Tasks) for integration tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Proposed changes
I've noticed that tool calling doesn't always get tested when adding support for new models to mlx-swift-lm. Additionally, modifying the tool calling code is, at least to me, a bit stressful because we don't have robust tool calling integration tests proving code changes don't break tool calling for older models.
This pull request hopes to start improving our support for tool call testing going forward by:
IntegrationTestModelsto support the new models being testedIntegrationTestModelsinside ofToolCallIntegrationTeststo allow for lazy loading of the models (meaning that developers who want to add a new model can just download the model they need if they only want to run the integration tests for that single model)Note: The Nemotron integration tests are skipped by default because the model is huge. However, enabling the tests involves just swapping
falsefortrueinXCSkipIf(). This seems like a sensible choice to me because of the size of the model, but I'd be curious what others think.Here are a few of the recent pull requests made to address broken tool calling in some popular models:
Checklist
Put an
xin the boxes that apply.pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes